Almathera Ten Pack 3: CDPD 3

home *** CD-ROM | disk | FTP | other *** search

/ Almathera Ten Pack 3: CDPD 3 / Almathera Ten on Ten - Disc 3: CDPD3.iso / scope / 201-220 / scopedisk208 / ibem / up100.lzh / Update_techie.doc < prev next >

Wrap

Text File | 1991-03-21 | 4KB | 70 lines

Techie info, or "How Does it Work?" ----------------------------------- Don't read this unless you actually care how this program works. The method is reminiscent of LZSS encoding. I don't know exactly what the SS stands for, but this isn't LZSS anyway so I'm just going to call it LZF (yes, the 'F' stands for Fehdrau; I can't afford vanity plates for my car so this will have to do). Somebody tell me if I'm off-base using L & Z's initials in the name. (Lempel & Ziv, I think their names were. Apologies to them if my memory fails me.) There is a control byte for the next 8 elements, which can either be new bytes (corresponding bit set to 0) or three-byte references to strings in the old program (corresponding bit set to 1). If a string matches only one or two characters in the old file, the individual bytes are stored, since that way it is either 9 (control bit + byte) or 18 (2 x (control_bit + byte)) bits instead of 25 (control bit plus three byte reference). Three or more bytes are stored as a reference, since 25 bits is less than 27+ (3+ x (control bit + byte)). The references are to a 4096 byte window into the old file. They are stored as 12 bits (hi->lo) of location and 12 bits (hi->lo) of length. A reference of 4095,4095 is legal, since access outside of the window is allowed, so long as it STARTS in the window. String accesses outside of file bounds are not allowed and are cut off if they are found. The 4096 byte window pointer starts out at the start of the old file. The window itself is bounded by pointer-1024 and pointer+3071. However, the bounds are modified (never the pointer, just the bounds) so that they are never outside of the file bounds, except with files smaller than 4096 bytes in which the upper bound is out of the file. The pointer increments by a value equal to the length of the longest match if it is between 1 and 32 bytes. Matches longer than 32 bytes (arbitrary limit, seems to work best) move the pointer to the end of that match in the old file. Using this method the dearchiving process can also figure out which 4096 bytes of the old file are in the window. "Seems kinda slow." Yeah. Unlike LZSS and LZH, I can't use a binary search tree to find the longest match because then I'd have to rebuild the tree after every change in the window position. Instead, I had to literally search the entire damn buffer. I got a little clever about it, though, so that if it, say, had a five-byte match already, it wouldn't check the first five bytes of any subsequent string unless the sixth byte matched the given string. That's the best I could think of. Better ideas, anyone? Anyway, using that method it's only slow on different files. A perfect match is really very fast, and close matches are quite respectable. If it goes too slow, the .up file is probably not going to be very small anyway. How well does it work? Quite, usually, for me. It's really meant as a patch program, not a full upgrade deal. The problem with executables is that 9/10 branches are relative, and therefore are changed if anything is added or deleted between the branch and its destination, so the relative offset changes and causes a string break. Similarly, if your compiler uses a register to access a global data area (ie. Manx's use of A4), and you add one single variable at the start of the data area, every reference to a global variable will be different. Too much of that and Update's not going to find many long strings. It's good for, and meant for, fixing that nasty division by zero error, unallocated string, version number, etc, but not changing 500 instances of printf() to fprintf(stderr,). Get it? By the way, if nothing ever matches, the resultant .up file will be 12.5% larger than the original (9 bits per new byte). For the future: Maybe add Huffman to the code. The control bytes, new bytes, and buffer locations should be more or less random. However, lengths will likely cluster. But that's another story.